Skip to content

3527: Avro-4193: [c] segfault cannot handle empty byte array#14

Closed
martin-augment wants to merge 2 commits intomainfrom
pr-3527-2025-10-27-11-12-13
Closed

3527: Avro-4193: [c] segfault cannot handle empty byte array#14
martin-augment wants to merge 2 commits intomainfrom
pr-3527-2025-10-27-11-12-13

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

3527: To review by AI

When avro-c tries to encode an empty byte array it will avro_malloc(0)
which on some architectures will return NULL.  Make sure this is not
interpreted as an error or or dereferenced causing a segfault.
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Oct 27, 2025

Caution

Review failed

Failed to post review comments

Walkthrough

This pull request extends Avro's language bindings with logical type support (decimal, duration, uuid, date, time, timestamp), strengthens PHP type safety via comprehensive type hints, updates C library dependencies, and refines binary encoding to omit NUL terminators from JSON strings.

Changes

Cohort / File(s) Summary
CI and Build Configuration
.github/workflows/test-lang-php.yml, composer.json
Adds composer.json to workflow triggers, extends PHP test matrix to versions 8.1–8.4, bumps Composer from 2.2.5 to 2.8.12, and updates PHP requirement from ^7.1 || ^8.0 to ^7.3 || ^8.0.
C Library Foundation
lang/c/CMakeLists.txt, lang/c/src/datum.c, lang/c/src/value-json.c
Updates Jansson minimum version from 2.3 to 2.7, refines memory allocation error reporting to only signal failures for non-zero sizes, and reworks JSON encoding to remove NUL terminators from output strings.
C Library Tests
lang/c/tests/test_avro_data.c
Hardens test allocator free logic, extends coverage for empty bytes scenarios with new test cases, and updates test data to reflect zero-byte changes.
PHP Encoder and Decoder Type System
lang/php/lib/Datum/AvroIOBinaryEncoder.php, lang/php/lib/Datum/AvroIODatumReader.php, lang/php/lib/Datum/AvroIODatumWriter.php, lang/php/lib/DataFile/AvroDataIOWriter.php
Adds explicit type hints and void return types to public methods, introduces decimal encoding via writeDecimal(), refactors reader methods for strict typing, adds dedicated writer helpers (writeBytes(), writeArray(), writeMap(), etc.), and supports new logical types in binary encoding.
PHP Logical Types Implementation
lang/php/lib/Datum/Type/AvroDuration.php (new), lang/php/lib/Schema/AvroLogicalType.php (new)
Introduces AvroDuration class for serializing 12-byte duration payloads (months, days, milliseconds) with validation, and new AvroLogicalType factory class supporting decimal, uuid, date, time, timestamp, and duration logical types.
PHP Schema Factories and Constants
lang/php/lib/Schema/AvroSchema.php, lang/php/lib/Schema/AvroFixedSchema.php, lang/php/lib/Schema/AvroPrimitiveSchema.php, lang/php/lib/Schema/AvroRecordSchema.php
Adds logical type constants and parsing for decimal, uuid, date, time-millis, timestamp-millis, duration, and local-timestamp variants; introduces factory methods on fixed/primitive schemas; validates decimal precision; handles logical type serialization in toAvro(); and refines primitive field parsing.
PHP Test Suite
lang/php/phpunit.xml, lang/php/test/DatumIOTest.php, lang/php/test/IODatumReaderTest.php, lang/php/test/SchemaTest.php
Disables forced coverage annotations, expands test coverage with data-driven tests for decimal and duration logical types, adds testRecordWithLogicalTypes() and validation for invalid logical type parameters, updates test method signatures with void returns and type hints, and migrates to modern PHP array syntax.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-3527-2025-10-27-11-12-13

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown

claude bot commented Oct 27, 2025

Pull Request Review

This PR merges three separate commits addressing bug fixes in C and feature additions in PHP:

  • AVRO-4136: JSON encoding of byte arrays containing null bytes (C)
  • AVRO-4190: Logical types support (PHP)
  • AVRO-4193: Empty byte array segfault fix (C)

Overall Assessment

The changes are generally well-implemented with appropriate test coverage. However, there are some concerns and suggestions for improvement.


C Code Changes

Positive Aspects

  1. Proper null byte handling (lang/c/src/value-json.c):

    • Correctly uses json_stringn_nocheck() instead of json_string_nocheck() to handle embedded null bytes
    • Removes the NUL terminator from the UTF-8 encoding, which was causing premature string truncation
    • The fix properly calculates buffer size without the terminator
  2. Empty bytes fix (lang/c/src/datum.c:156-157, 199-200):

    • Properly handles the edge case where avro_malloc(0) returns NULL on some architectures
    • The condition !bytes_copy && size correctly distinguishes between allocation failure and zero-size allocation
  3. Jansson version requirement (lang/c/CMakeLists.txt:182):

    • Bumping jansson requirement from 2.3 to 2.7 is appropriate as json_stringn_nocheck() was added in jansson 2.7

Concerns

CRITICAL - Memory Safety Issue in lang/c/src/value-json.c:65:

uint8_t  *dest8 = (uint8_t *) avro_malloc(utf8_len);
if (dest8 == NULL && utf8_len) {

The condition dest8 == NULL && utf8_len means that when utf8_len == 0, a NULL pointer will be returned without error. However, later in the code at line 71-76, the code dereferences curr which could be NULL:

uint8_t  *curr = dest8;
for (i = 0; i < src_len; i++) {
    if (src8[i] & 0x80) {
        *curr++ = (0xc0 | (src8[i] >> 6));  // Could crash if dest8/curr is NULL
        *curr++ = (0x80 | (src8[i] & 0x3f));

Recommendation: Initialize dest8 to an empty string or handle the zero-length case explicitly before the loop.

Minor - Test allocator fix (lang/c/tests/test_avro_data.c:43):
The guard if (ptr) is good defensive programming for the zero-size free case.


PHP Code Changes

Positive Aspects

  1. Logical types implementation - Comprehensive support for:

    • Decimal (bytes and fixed)
    • Duration
    • Proper schema validation
  2. Type hints added - Many methods now have proper type hints (e.g., AvroSchema, void return types), improving code quality

  3. Test coverage - Extensive tests added in DatumIOTest.php, IODatumReaderTest.php, and SchemaTest.php

Concerns

Medium - Decimal precision validation (lang/php/lib/Datum/AvroIOBinaryEncoder.php:195-200):

$maxValue = 10 ** $precision;
if (abs($value) >= $maxValue) {

This validation is incorrect. For decimal with precision=5, the max value should be 99999 (5 nines), but 10 ** 5 = 100000. The check should be > not >=, or better yet, check against (10 ** $precision) - 1.

Low - Inconsistent error handling (lang/php/lib/Datum/AvroIOBinaryEncoder.php:186):

if (!is_numeric($decimal)) {
    throw new AvroException("Decimal value '{$decimal}' must be numeric");
}

Consider using a more specific exception type if available, or at least document the expected exception hierarchy.

Low - Potential rounding issues (lang/php/lib/Datum/AvroIOBinaryEncoder.php:190-192):

$value = $decimal * (10 ** $scale);
if (!is_int($value)) {
    $value = (int) round($value);
}

Using round() may introduce precision errors for large decimals. Consider using bcmul() and bcadd() for arbitrary precision arithmetic if BC Math is available.

Low - Magic number (lang/php/lib/Datum/AvroIODatumReader.php:353):

if (strlen($encodedDuration) !== 12) {

Consider defining a constant like DURATION_FIXED_SIZE = 12 for better maintainability.

Code style - Mixed return types (lang/php/lib/Datum/AvroIOBinaryEncoder.php:55):

public function writeNull($datum): void
{
    return;  // Unnecessary explicit return in void function
}

While not wrong, explicit return; in void functions is redundant. Either remove it or be consistent.


Workflow & Configuration Changes

Positive

  • PHP version matrix expanded (.github/workflows/test-lang-php.yml):
    • Added PHP 8.1, 8.2, 8.3, 8.4 testing
    • Composer version updated to 2.8.12
    • PHP minimum version bumped from 7.1 to 7.3 (7.1 is EOL)

Concerns

Low - Composer version pinning:
Consider using composer:2 or composer:latest instead of pinning to 2.8.12 to automatically get security updates.


Test Coverage

Strengths

  1. C tests properly cover:

    • Empty byte arrays
    • Byte arrays with embedded null bytes (0x00)
    • Fixed types with null bytes
  2. PHP tests extensively cover:

    • Decimal logical type (bytes and fixed)
    • Duration logical type
    • Logical type validation
    • Round-trip serialization/deserialization

Gaps

  • No explicit test for the edge case where jansson < 2.7 is used (though CI should catch this via CMake error)
  • PHP decimal overflow scenarios could use more edge case testing (max precision values)

Security Concerns

  1. Critical: Potential null pointer dereference in lang/c/src/value-json.c when encoding zero-length bytes (see above)
  2. Low: No input validation on decimal precision/scale values in PHP before calculations - malicious schemas could potentially cause integer overflow

Performance Considerations

  1. PHP decimal encoding: Using pack('J') and string operations is efficient for small values but may not scale well for very large decimals. Consider documenting performance characteristics.

  2. C UTF-8 encoding: The double-pass approach (calculate size, then encode) is reasonable but could be optimized for common cases where no bytes > 0x7f exist.


Recommendations

Must Fix

  1. Fix the null pointer dereference risk in lang/c/src/value-json.c:65-76 for zero-length byte arrays

Should Fix

  1. Correct the decimal precision validation logic in PHP
  2. Add bounds checking for precision/scale parameters

Consider

  1. Use BC Math for decimal arithmetic in PHP for better precision
  2. Update Composer version specification to use semantic versioning
  3. Add more edge case tests for decimal boundaries
  4. Document performance characteristics of logical type encoding

Conclusion

This PR addresses real bugs and adds valuable functionality. The C fixes for null byte handling and empty arrays are critical for correctness. The PHP logical types support is well-implemented with good test coverage.

However, the potential null pointer dereference in the C code for zero-length arrays should be addressed before merging. The PHP decimal validation issue is less critical but should also be fixed.

Overall: Request Changes - Fix the critical C issue, then this will be ready to merge.

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 3 suggestions posted.

}

json_t *result = json_string_nocheck((const char *) encoded);
json_t *result = json_stringn_nocheck((const char *) encoded ? encoded : "", encoded_size);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conditional cast applies only to the condition; as written, (const char *) encoded ? encoded : "" passes a pointer of incompatible type and may fail to compile. Consider casting the entire conditional result so the argument to json_stringn_nocheck is a const char * in both branches.

🤖 React with 👍 or 👎 to let us know if the comment was useful.

$scale = $logicalType->attributes()['scale'] ?? 0;
$precision = $logicalType->attributes()['precision'] ?? null;

$encoder->writeDecimal($datum, $scale, $precision);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For fixed decimals, Avro requires fixed-size two’s complement bytes without a length prefix; calling writeDecimal here writes a length-prefixed byte array. Consider writing exactly $writers_schema->size() bytes for fixed decimal (also applies to reading at AvroIODatumReader::readFixed where readBytes() is used at line 353).

🤖 React with 👍 or 👎 to let us know if the comment was useful.

);
}

$packed = pack('J', $value);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pack('J', $value) uses machine endianness; Avro decimal logical types (bytes/fixed) must use big-endian two’s complement. On little-endian platforms this will emit the wrong byte order, so consider producing big-endian bytes (or reversing before trimming) to ensure interoperable encoding.

🤖 React with 👍 or 👎 to let us know if the comment was useful.

@martin-augment
Copy link
Copy Markdown
Owner Author

closed this PR because the main branch was out of date and the PR had diffs from other PRs.
The new one is #15

@claude
Copy link
Copy Markdown

claude bot commented Oct 27, 2025

Code Review: AVRO-4193 - Support for Empty Byte Arrays

Summary

This PR fixes a critical bug where encoding empty byte arrays causes segfaults due to avro_malloc(0) returning NULL on some architectures. The fix correctly distinguishes between allocation failures and zero-size allocations.

Code Quality & Best Practices ✅

Strengths:

  1. Correct NULL check pattern: The changes consistently use if (!ptr && size) which properly handles:

    • Zero-size allocations (ptr=NULL, size=0) → Valid, continue execution
    • Failed allocations (ptr=NULL, size>0) → Error, return failure
  2. Comprehensive test coverage: Added test_empty_bytes() covering multiple scenarios:

    • Encoding/decoding empty bytes
    • JSON serialization
    • Equality checks
    • Setting empty bytes via avro_givebytes_set()
  3. Consistency: Applied the fix to all relevant locations (datum.c, value-json.c, test_avro_data.c)

Potential Issues 🔍

Critical: Potential UB with memcpy (HIGH PRIORITY)

Location: lang/c/src/datum.c:163, lang/c/src/datum.c:204

char *bytes_copy = (char *) avro_malloc(size);
if (!bytes_copy && size) {
    // error handling
}
memcpy(bytes_copy, bytes, size);  // ⚠️ ISSUE HERE

Problem: When size == 0, bytes_copy will be NULL (from malloc(0)), but the code still calls memcpy(NULL, bytes, 0). According to C11 §7.24.1¶2, calling memcpy() with NULL pointers is undefined behavior even when size is 0.

Recommendation: Add a guard before memcpy:

if (size > 0) {
    memcpy(bytes_copy, bytes, size);
}

This affects:

  • lang/c/src/datum.c:163 in avro_bytes()
  • lang/c/src/datum.c:204 in avro_bytes_set()

Medium: Test allocator logic improvement

Location: lang/c/tests/test_avro_data.c:43-48

The fix adds a NULL check before dereferencing ptr, which is good. However, the test allocator should document the behavior that NULL is a valid return for size=0 allocations.

Current code:

if (nsize == 0) {
    if (ptr) {
        // free logic
    }
    return NULL;
}

This correctly handles free(NULL), which is good defensive programming. ✅

Minor: Schema decref duplication

Location: lang/c/tests/test_avro_data.c:230, lang/c/tests/test_avro_data.c:243

avro_schema_decref(writer_schema);  // Line 230
// ... more code ...
avro_schema_decref(writer_schema);  // Line 243 - duplicate?

The schema is decref'd twice in the test. This appears to be correct since there are two separate test sections, but verify the reference counting is balanced.

Performance Considerations ⚡

  • Minimal impact: The added checks (&& size) are simple integer comparisons with negligible overhead
  • No new allocations: The fix only changes error handling logic
  • Zero-size optimization: Avoids unnecessary error messages for legitimate empty byte arrays

Security Concerns 🔒

Positive Changes:

  1. Prevents NULL dereference: The fix prevents crashes when handling empty byte arrays
  2. Proper error differentiation: Distinguishes allocation failures from valid zero-size cases

Remaining Concerns:

  1. NULL deref in JSON encoding (Fixed ✅): Line 129 in value-json.c properly handles:

    json_stringn_nocheck((const char *) encoded ? encoded : "", encoded_size);

    Good defensive programming - uses empty string when encoded is NULL.

  2. Input validation: The code assumes bytes parameter is valid (not NULL) when size > 0. This should be documented or validated if it's part of the public API.

Test Coverage 📊

Excellent Coverage ✅:

  • Empty array creation: avro_givebytes(bytes, sizeof(bytes), NULL) where sizeof=0
  • Serialization round-trip: write_read_check()
  • JSON encoding: test_json(datum, "\"\"")
  • Mutation: avro_givebytes_set() with empty arrays
  • Equality comparison: avro_datum_equal()

Suggestions:

  1. Add test for avro_bytes() with empty array (currently only tests avro_givebytes())
  2. Add test for avro_bytes_set() with size=0
  3. Consider fuzzing with various edge cases (NULL bytes with size=0, etc.)

Recommendations

Must Fix:

  1. ❗ Add memcpy guards to prevent UB with NULL pointers

Should Consider:

  1. Document the behavior of avro_malloc(0) in the codebase
  2. Add API documentation clarifying when NULL is acceptable for bytes parameter

Nice to Have:

  1. Expand test coverage to include avro_bytes() and avro_bytes_set() variants
  2. Add comments explaining the && size checks for future maintainers

Verdict

Approve with changes requested

The fix correctly addresses the segfault issue and includes good test coverage. However, the memcpy undefined behavior should be fixed before merging to ensure full C standard compliance.

Great work on identifying and fixing this edge case! The test coverage demonstrates thorough thinking about the problem.

@github-actions
Copy link
Copy Markdown

Findings

  • Major – lang/php/lib/Datum/AvroIOBinaryEncoder.php:192 / lang/php/lib/Datum/AvroIODatumReader.php:513: writeDecimal packs integers with 'J', treats byte 0 as the most significant bit, and readDecimal unpacks with 'J'. 'J' is machine-endian, so on little-endian platforms we emit/expect little-endian two’s-complement instead of the big-endian form Avro specifies. Other runtimes will misinterpret these decimals (and the MSB trimming logic becomes incorrect). Please switch to an explicit big-endian two’s-complement encoding/decoding (or reuse the existing long/bytes helpers) before trimming/padding.
  • Major – lang/php/lib/Datum/AvroIODatumReader.php:247 & 344: logical-type compatibility is checked with !==. Because logicalType() returns distinct objects for writer and reader schemas that merely share the same contents, identical logical types will still compare as different objects and trigger AvroIOSchemaMatchException. Compare by logical-type name/attributes (or normalize the instances) instead of relying on object identity.
  • Major – lang/php/lib/Datum/AvroIODatumWriter.php:190 / lang/php/lib/Datum/AvroIODatumReader.php:349: fixed decimals are going through writeDecimal/readBytes, which add/expect a bytes length prefix. Avro fixed values must be written and read as raw bytes of exactly the fixed size with no preceding length. This currently corrupts the binary stream. Encode to the fixed length (padding as required) and call $encoder->write(...), and read with $decoder->read($writers_schema->size()) before handing the bytes to the decimal converter.

Let me know if you’d like help patching these spots or adding regression coverage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants